Skip to content

Use NTP admin API instead of sled agent, remove sled agent time_sync API #8597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jul 15, 2025

Builds on #8555

This PR:

  • Avoids the sled agent "zlogin to NTP zone" calls to chronyc, instead relying on the NTP admin server
  • Removes the "timesync_get" API exposed from the sled agent, and moves all callers (RSS only) to call the NTP admin server instead

Fixes #8590

@smklein smklein force-pushed the ntp-admin-use branch 2 times, most recently from c4faf10 to 8f65850 Compare July 15, 2025 17:51
@smklein smklein marked this pull request as ready for review July 15, 2025 19:53
@smklein smklein requested review from jgallagher and karencfv July 15, 2025 21:54
@smklein
Copy link
Collaborator Author

smklein commented Jul 15, 2025

FYI, I ran this on a4x2 - it didn't work, until I fixed an SMF typo, now it works. This is tested implicitly by the "helios / deploy" job, which relies on RSS, and also relies on the timesync API in the NTP admin server.

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@@ -1464,29 +1464,6 @@
}
}
},
"/timesync": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's any documentation that relies on checking the response from this endpoint in the logs to debug an NTP sync issue or something similar. I have a vague memory of something like that. If anyone remembers, that documentation should be updated. Perhaps an announcement on Matrix might be good as well? I've certainly used it before if an omicron deployment was taking too long to start.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

FYI, I ran this on a4x2 - it didn't work, until I fixed an SMF typo, now it works. This is tested implicitly by the "helios / deploy" job, which relies on RSS, and also relies on the timesync API in the NTP admin server.

I don't see an SMF change in this PR - was it something a4x2-specific outside omicron?

let ntp_addresses: Vec<_> = service_plan
.services
.iter()
.flat_map(|(_, sled_config)| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda tempted to suggest we do something here to confirm we have exactly one NTP zone per sled (probably via asserting, since we came up with the plan and it should always be correct?). Maybe that would be overkill, since 0 or more-than-one NTP zone on a sled is going to cause problems pretty quickly anyway? Something like (using exactly_one() from itertools):

let ntp_addresses: Vec<_> = service_plan
    .services
    .iter()
    .map(|(_, sled_config)| {
        sled_config.zones.iter().filter_map(|zone_config| {
            // ...extract ntp_admin_addr...
        })
        .exactly_one()
        .expect("plan should specify one NTP zone per sled")
    })
    .collect();

Comment on lines +1379 to +1395
BlueprintZoneType::BoundaryNtp(
blueprint_zone_type::BoundaryNtp {
address, ..
},
) => {
let mut ntp_admin_addr = *address;
ntp_admin_addr.set_port(NTP_ADMIN_PORT);
Some(ntp_admin_addr)
}
BlueprintZoneType::InternalNtp(
blueprint_zone_type::InternalNtp { address },
) => {
let mut ntp_admin_addr = *address;
ntp_admin_addr.set_port(NTP_ADMIN_PORT);
Some(ntp_admin_addr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both of these arms bind a field named address with the same type, I think you can combine them?

                        BlueprintZoneType::BoundaryNtp(
                            blueprint_zone_type::BoundaryNtp {
                                address, ..
                            },
                        ) 
                        | BlueprintZoneType::InternalNtp(
                            blueprint_zone_type::InternalNtp { address },
                        ) => {
                            let mut ntp_admin_addr = *address;
                            ntp_admin_addr.set_port(NTP_ADMIN_PORT);
                            Some(ntp_admin_addr)
                        }

@@ -724,17 +728,20 @@ impl ServiceInner {

async fn wait_for_timesync(
&self,
sled_addresses: &Vec<SocketAddrV6>,
ntp_admin_addresses: &Vec<SocketAddrV6>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is consistent with what was already here, but: by accepting a list of socket addrs, we end up rebuilding an NtpAdminClient for every sled inside every attempt of the retry_notify loop. Maybe we should take a &[NtpAdminClient] instead of a list of socket addrs, so we only create the clients once? (I have a vague memory of reqwest clients being relatively expensive to construct, but maybe we solved that some other way?)


let stdout = running_ntp_zone
.run_cmd(&["/usr/bin/chronyc", "-c", "tracking"])
.map_err(TimeSyncError::ExecuteChronyc)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Love getting rid of this.

@smklein
Copy link
Collaborator Author

smklein commented Jul 16, 2025

LGTM, thanks!

FYI, I ran this on a4x2 - it didn't work, until I fixed an SMF typo, now it works. This is tested implicitly by the "helios / deploy" job, which relies on RSS, and also relies on the timesync API in the NTP admin server.

I don't see an SMF change in this PR - was it something a4x2-specific outside omicron?

I pulled the SMF change into #8555

(<dependency name='oxide/ntp' causes the manifest importer to reject the whole service because in that context, it decides / is a character it doesn't like. I had to use <dependency name='ntp' to make it happy)

Base automatically changed from ntp-admin to main July 16, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants